Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models.
translated by 谷歌翻译
Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.
translated by 谷歌翻译
过程学习涉及确定键步并确定其逻辑顺序以执行任务。现有方法通常使用第三人称视频来学习该过程,使操纵对象的外观很小,并且经常被演员遮住,从而导致重大错误。相比之下,我们观察到从第一人称(Egentric)可穿戴摄像机获得的视频提供了对动作的毫无开创且清晰的视野。但是,从以eg中心视频学习的程序学习是具有挑战性的,因为(a)由于佩戴者的头部运动,相机视图发生了极端变化,并且(b)由于视频的不受约束性质而存在无关的框架。因此,当前的最新方法的假设是,该动作大约同时发生并且持续时间相同,因此不持有。取而代之的是,我们建议使用视频键位之间的时间对应关系提供的信号。为此,我们提出了一个新颖的自我监督对应和剪切(CNC),用于程序学习。 CNC识别并利用多个视频的键步之间的时间对应关系来学习该过程。我们的实验表明,CNC的表现分别优于基准Procel和Crosstask数据集上的最先进,分别为5.2%和6.3%。此外,对于使用以Egentric视频为中心的程序学习,我们建议使用Egoprocel数据集,该数据集由130名受试者捕获的62个小时的视频组成,执行16个任务。源代码和数据集可在项目页面https://sid2697.github.io/egoprocel/上获得。
translated by 谷歌翻译
端到端(E2E)模型在口语理解(SLU)系统中变得越来越流行,并开始实现基于管道的方法的竞争性能。但是,最近的工作表明,这些模型努力以相同的意图概括为新的措辞,这表明模型无法理解给定话语的语义内容。在这项工作中,我们在E2E-SLU框架内的未标记文本数据中预先训练了在未标记的文本数据上进行预先训练的语言模型,以构建强大的语义表示。同时结合语义信息和声学信息可以增加推理时间,从而在语音助手等应用程序中部署时会导致高潜伏期。我们开发了一个2频道的SLU系统,该系统使用第一张音频的几秒钟的声学信息进行低潜伏期预测,并通过结合语义和声学表示在第二次通过中进行更高质量的预测。我们从先前的2次端到端语音识别系统上的工作中获得了灵感,该系统同时使用审议网络就可以在音频和第一通道假设上进行。所提出的2个通用SLU系统在Fluent Speech命令挑战集和SLURP数据集上优于基于声学的SLU模型,并减少了延迟,从而改善了用户体验。作为ESPNET-SLU工具包的一部分,我们的代码和模型公开可用。
translated by 谷歌翻译
世界各地的隐私法律和法规的景观是复杂而不断变化的。国家和超国家法律,协议,法令和其他政府发行的规则构成了公司必须遵循的拼凑而成才能在国际上进行运作。为了检查该拼凑而成的状态和演变,我们介绍了1,043条隐私法,法规和准则的政府隐私指示语料库或GPI语料库,涵盖了182个司法管辖区。该语料库可以对法律焦点进行大规模定量和定性检查。我们检查了创建GPI的时间分布,并说明了过去50年中隐私立法的急剧增加,尽管较细粒度的检查表明,增加的速度取决于GPIS所说的个人数据类型。我们的探索还表明,大多数隐私法分别解决了相对较少的个人数据类型,这表明全面的隐私立法仍然很少见。此外,主题建模结果显示了GPI中常见主题的普遍性,例如财务,医疗保健和电信。最后,我们将语料库释放到研究界,以促进进一步的研究。
translated by 谷歌翻译
在尝试“解释”机器学习模型的预测中,研究人员提出了数百种技术,以归因于认为重要的功能的预测。虽然这些归属常常被声称持有改善人类“了解”模型的潜力,但令人惊讶地小的工作明确评估了对这种愿望的进步。在本文中,我们进行了一个众群研究,参与者与欺骗检测模型进行互动,以区分真实和假酒店评论。他们受到模拟新鲜评论模型的挑战,并以降低最初预测的类的概率的目标。成功的操纵将导致对抗性示例。在培训(但不是测试)阶段,突出显示输入跨度以传达Parience。通过我们的评估,我们观察到,对于线性袋式模型,与无解释控制相比,可以在训练期间访问特征系数的参与者能够在测试阶段中更大减少模型置信度。对于基于BERT的分类器,流行的本地解释不会提高它们在无法解释案例上降低模型信心的能力。值得注意的是,当由培训的线性模型的(全局)归属的(全局)归属给出的解释以模仿BERT模型,人们可以有效地操纵模型。
translated by 谷歌翻译
随着自动语音处理(ASR)系统越来越好,使用ASR输出越来越令于进行下游自然语言处理(NLP)任务。但是,很少的开源工具包可用于在不同口语理解(SLU)基准上生成可重复的结果。因此,需要建立一个开源标准,可以用于具有更快的开始进入SLU研究。我们展示了Espnet-SLU,它旨在在一个框架中快速发展口语语言理解。 Espnet-SLU是一个项目内部到结束语音处理工具包,ESPNET,它是一个广泛使用的开源标准,用于各种语音处理任务,如ASR,文本到语音(TTS)和语音转换(ST)。我们增强了工具包,为各种SLU基准提供实现,使研究人员能够无缝混合和匹配不同的ASR和NLU模型。我们还提供预磨损的模型,具有集中调谐的超参数,可以匹配或甚至优于最新的最先进的性能。该工具包在https://github.com/espnet/espnet上公开提供。
translated by 谷歌翻译
Research has shown that climate change creates warmer temperatures and drier conditions, leading to longer wildfire seasons and increased wildfire risks in the United States. These factors have in turn led to increases in the frequency, extent, and severity of wildfires in recent years. Given the danger posed by wildland fires to people, property, wildlife, and the environment, there is an urgency to provide tools for effective wildfire management. Early detection of wildfires is essential to minimizing potentially catastrophic destruction. In this paper, we present our work on integrating multiple data sources in SmokeyNet, a deep learning model using spatio-temporal information to detect smoke from wildland fires. Camera image data is integrated with weather sensor measurements and processed by SmokeyNet to create a multimodal wildland fire smoke detection system. We present our results comparing performance in terms of both accuracy and time-to-detection for multimodal data vs. a single data source. With a time-to-detection of only a few minutes, SmokeyNet can serve as an automated early notification system, providing a useful tool in the fight against destructive wildfires.
translated by 谷歌翻译
Drawing from the resources of psychoanalysis and critical media studies, in this paper we develop an analysis of Large Language Models (LLMs) as automated subjects. We argue the intentional fictional projection of subjectivity onto LLMs can yield an alternate frame through which AI behaviour, including its productions of bias and harm, can be analysed. First, we introduce language models, discuss their significance and risks, and outline our case for interpreting model design and outputs with support from psychoanalytic concepts. We trace a brief history of language models, culminating with the releases, in 2022, of systems that realise state-of-the-art natural language processing performance. We engage with one such system, OpenAI's InstructGPT, as a case study, detailing the layers of its construction and conducting exploratory and semi-structured interviews with chatbots. These interviews probe the model's moral imperatives to be helpful, truthful and harmless by design. The model acts, we argue, as the condensation of often competing social desires, articulated through the internet and harvested into training data, which must then be regulated and repressed. This foundational structure can however be redirected via prompting, so that the model comes to identify with, and transfer, its commitments to the immediate human subject before it. In turn, these automated productions of language can lead to the human subject projecting agency upon the model, effecting occasionally further forms of countertransference. We conclude that critical media methods and psychoanalytic theory together offer a productive frame for grasping the powerful new capacities of AI-driven language systems.
translated by 谷歌翻译
This work presents a physics-informed deep learning-based super-resolution framework to enhance the spatio-temporal resolution of the solution of time-dependent partial differential equations (PDE). Prior works on deep learning-based super-resolution models have shown promise in accelerating engineering design by reducing the computational expense of traditional numerical schemes. However, these models heavily rely on the availability of high-resolution (HR) labeled data needed during training. In this work, we propose a physics-informed deep learning-based framework to enhance the spatial and temporal resolution of coarse-scale (both in space and time) PDE solutions without requiring any HR data. The framework consists of two trainable modules independently super-resolving the PDE solution, first in spatial and then in temporal direction. The physics based losses are implemented in a novel way to ensure tight coupling between the spatio-temporally refined outputs at different times and improve framework accuracy. We analyze the capability of the developed framework by investigating its performance on an elastodynamics problem. It is observed that the proposed framework can successfully super-resolve (both in space and time) the low-resolution PDE solutions while satisfying physics-based constraints and yielding high accuracy. Furthermore, the analysis and obtained speed-up show that the proposed framework is well-suited for integration with traditional numerical methods to reduce computational complexity during engineering design.
translated by 谷歌翻译